1 Introduction to Survival Analysis

Configuring R

Functions from these packages will be used throughout this document:

[R code]
library(conflicted) # check for conflicting function definitions
# library(printr) # inserts help-file output into markdown output
library(rmarkdown) # Convert R Markdown documents into a variety of formats.
library(pander) # format tables for markdown
library(ggplot2) # graphics
library(ggfortify) # help with graphics
library(dplyr) # manipulate data
library(tibble) # `tibble`s extend `data.frame`s
library(magrittr) # `%>%` and other additional piping tools
library(haven) # import Stata files
library(knitr) # format R output for markdown
library(tidyr) # Tools to help to create tidy data
library(plotly) # interactive graphics
library(dobson) # datasets from Dobson and Barnett 2018
library(parameters) # format model output tables for markdown
library(haven) # import Stata files
library(latex2exp) # use LaTeX in R code (for figures and tables)
library(fs) # filesystem path manipulations
library(survival) # survival analysis
library(survminer) # survival analysis graphics
library(KMsurv) # datasets from Klein and Moeschberger
library(parameters) # format model output tables for
library(webshot2) # convert interactive content to static for pdf
library(forcats) # functions for categorical variables ("factors")
library(stringr) # functions for dealing with strings
library(lubridate) # functions for dealing with dates and times

Here are some R settings I use in this document:

[R code]
rm(list = ls()) # delete any data that's already loaded into R

conflicts_prefer(dplyr::filter)
ggplot2::theme_set(
  ggplot2::theme_bw() + 
        # ggplot2::labs(col = "") +
    ggplot2::theme(
      legend.position = "bottom",
      text = ggplot2::element_text(size = 12, family = "serif")))

knitr::opts_chunk$set(message = FALSE)
options('digits' = 6)

panderOptions("big.mark", ",")
pander::panderOptions("table.emphasize.rownames", FALSE)
pander::panderOptions("table.split.table", Inf)
conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default
legend_text_size = 9
run_graphs = TRUE

1.1 Overview

Time-to-event outcomes

Survival analysis is a framework for modeling time-to-event outcomes. It is used in:

  • clinical trials, where the event is often death or recurrence of disease.
  • engineering reliability analysis, where the event is failure of a device or system.
  • insurance, particularly life insurance, where the event is death.

Note

The term survival analysis is a bit misleading. Survival outcomes can sometimes be analyzed using binomial models (logistic regression). Time-to-event models or survival time analysis might be a better name.

1.2 Time-to-event outcome distributions

Distributions of Time-to-Event Data

  • The distribution of event times is asymmetric and can be long-tailed, and starts at 0 (that is, \(P(T<0) = 0\)).
  • The base distribution is not normal, but exponential.
  • There are usually censored observations, which are ones in which the failure time is not observed.
  • Often, these are right-censored, meaning that we know that the event occurred after some known time \(t\), but we don’t know the actual event time, as when a patient is still alive at the end of the study.
  • Observations can also be left-censored, meaning we know the event has already happened at time \(t\), or interval-censored, meaning that we only know that the event happened between times \(t_1\) and \(t_2\).
  • Analysis is difficult if censoring is associated with treatment.

Right Censoring

  • Patients are in a clinical trial for cancer, some on a new treatment and some on standard of care.
  • Some patients in each group have died by the end of the study. We know the survival time (measured for example from time of diagnosis—each person on their own clock).
  • Patients still alive at the end of the study are right censored.
  • Patients who are lost to follow-up or withdraw from the study may be right-censored.

Left and Interval Censoring

  • An individual tests positive for HIV.
  • If the event is infection with HIV, then we only know that it has occurred before the testing time \(t\), so this is left censored.
  • If an individual has a negative HIV test at time \(t_1\) and a positive HIV test at time \(t_2\), then the infection event is interval censored.

1.3 Distribution functions for time-to-event variables

The Probability Density Function (PDF)

For a time-to-event variable \(T\) with a continuous distribution, the probability density function is defined as usual (see probability density function).

Example 1 (exponential distribution) Recall from Epi 202: the pdf of the exponential distribution family of models is:

\[p(T=t) = \text{1}_{t \ge 0} \cdot \lambda \text{e}^{-\lambda t}\]

where \(\lambda > 0\).


Here are some examples of exponential pdfs:

The Cumulative Distribution Function (CDF)

The cumulative distribution function is defined as:

\[ \begin{aligned} F(t) &\stackrel{\text{def}}{=}\Pr(T \le t)\\ &=\int_{u=-\infty}^t f(u) du \end{aligned} \]

Example 2 (exponential distribution) Recall from Epi 202: the cdf of the exponential distribution family of models is:

\[ P(T\le t) = \mathbb{1}_{t \ge 0} \cdot (1- \text{e}^{-\lambda t}) \] where \(\lambda > 0\).

Here are some examples of exponential cdfs:

The Survival Function

Definition 1 (Survival function)  

\[\text{S}(t) \stackrel{\text{def}}{=}\Pr(T > t)\]

Theorem 1 \[ \begin{aligned} \text{S}(t) &\stackrel{\text{def}}{=}\Pr(T > t)\\ &=\int_{u=t}^\infty p(u) du\\ &=1-F(t)\\ \end{aligned} \]

Example 3 (exponential distribution) Since \(\text{S}(t) = 1 - F(t)\), the survival function of the exponential distribution family of models is:

\[ P(T> t) = \left\{ {{\text{e}^{-\lambda t}, t\ge0} \atop {1, t \le 0}}\right. \] where \(\lambda > 0\).

Figure 1 shows some examples of exponential survival functions.

[R code]
library(ggplot2)
ggplot() +
  geom_function(
    aes(col = "0.5"),
    fun = pexp,
    args = list(lower = FALSE, rate = 0.5)
  ) +
  geom_function(
    aes(col = "p = 1"),
    fun = pexp,
    args = list(lower = FALSE, rate = 1)
  ) +
  geom_function(
    aes(col = "p = 1.5"),
    fun = pexp,
    args = list(lower = FALSE, rate = 1.5)
  ) +
  geom_function(
    aes(col = "p = 5"),
    fun = pexp,
    args = list(lower = FALSE, rate = 5)
  ) +
  theme_bw() +
  ylab("S(t)") +
  guides(col = guide_legend(title = expr(lambda))) +
  xlab("Time (t)") +
  xlim(0, 2.5) +
  theme(
    legend.position = "bottom",
    axis.title.x =
      element_text(
        angle = 0,
        vjust = 1,
        hjust = 1
      ),
    axis.title.y =
      element_text(
        angle = 0,
        vjust = 1,
        hjust = 1
      )
  )
Figure 1: Exponential Survival Functions

Theorem 2 If \(A_t\) represents survival status at time \(t\), with \(A_t = 1\) denoting alive at time \(t\) and \(A_t = 0\) denoting deceased at time \(t\), then:

\[\text{S}(t) = \Pr(A_t=1) = \text{E}[A_t]\]

Theorem 3 If \(T\) is a nonnegative random variable, then:

\[\text{E}[T] = \int_{t=0}^{\infty} \text{S}(t)dt\]

The Hazard Function

Another important quantity is the hazard function:

Definition 2 (Hazard function, hazard rate, hazard rate function)  

\[{\lambda}(t) \stackrel{\text{def}}{=}\text{p}(T=t|T\ge t)\]

Definition 3 (Incidence rate) Given a population of \(N\) individuals indexed by \(i\), each with their own hazard rate \({\lambda}_i(t)\), the incidence rate for that population is the mean hazard rate:

\[\bar{{\lambda}}(t) \stackrel{\text{def}}{=}\frac{1}{N}\sum_{i=1}^N{\lambda}_i(t)\]

Theorem 4 (Incidence rate in a homogenous population) If a population of individuals indexed by \(i\) all have identical hazard rates \({\lambda}_i(t) = {\lambda}(t)\), then the incidence rate for that population is equal to the hazard rate:

\[\bar{{\lambda}}(t) = {\lambda}(t)\]

Lemma 1 (Joint probability of a variable with itself) \[p(T=t, T\ge t) = p(T=t)\]

Proof. Recall from Epi 202: if \(A\) and \(B\) are statistical events and \(A\subseteq B\), then \(p(A, B) = p(A)\). In particular, \(\{T=t\} \subseteq \{T\geq t\}\), so \(p(T=t, T\ge t) = p(T=t)\).

Theorem 5 (Hazard equals density over survival) \[{\lambda}(t)=\frac{f(t)}{\text{S}(t)}\]

Proof. \[ \begin{aligned} {\lambda}(t) &=p(T=t|T\ge t)\\ &=\frac{p(T=t, T\ge t)}{p(T \ge t)}\\ &=\frac{p(T=t)}{p(T \ge t)}\\ &=\frac{f(t)}{\text{S}(t)} \end{aligned} \]

Example 4 (exponential distribution) The hazard function of the exponential distribution family of models is:

\[ \begin{aligned} P(T=t|T \ge t) &= \frac{f(t)}{\text{S}(t)}\\ &= \frac{\mathbb{1}_{t \ge 0}\cdot \lambda \text{e}^{-\lambda t}}{\text{e}^{-\lambda t}}\\ &=\mathbb{1}_{t \ge 0}\cdot \lambda \end{aligned} \] Figure 2 shows some examples of exponential hazard functions.

Figure 2: Examples of hazard functions for exponential distributions

We can also view the hazard function as the derivative of the negative of the logarithm of the survival function:

Theorem 6 (transform survival to hazard) \[{\lambda}(t) = \frac{\partial}{\partial t}{\left\{-\log{\text{S}(t)}\right\}}\]

Proof. \[ \begin{aligned} {\lambda}(t) &= \frac{f(t)}{\text{S}(t)}\\ &= \frac{-S'(t)}{\text{S}(t)}\\ &= -\frac{S'(t)}{\text{S}(t)}\\ &=-\frac{\partial}{\partial t}\log{\text{S}(t)}\\ &=\frac{\partial}{\partial t}{\left\{-\log{\text{S}(t)}\right\}} \end{aligned} \]

Definition 4 (hazard ratio) \[\theta(t| \tilde{x}: {\tilde{x}^*}) \stackrel{\text{def}}{=}\frac{{\lambda}(t|\tilde{x})}{{\lambda}(t|{\tilde{x}^*})}\]

The Cumulative Hazard Function

Since \({\lambda}(t) = \frac{\partial}{\partial t}{\left\{-\log{\text{S}(t)}\right\}}\) (see Theorem 6), we also have:

Corollary 1 \[\text{S}(t) = \text{exp}{\left\{-\int_{u=0}^t {\lambda}(u)du\right\}} \tag{1}\]

Definition 5 (cumulative hazard) The cumulative hazard function, often denoted \(\Lambda(t)\) or \(\text{H}(t)\), is defined as:

\[{\Lambda}(t) \stackrel{\text{def}}{=}\int_{u=0}^t {\lambda}(u) du\]

As we will see below, \({\Lambda}(t)\) is tractable to estimate, and we can then derive an estimate of the hazard function using an approximate derivative of the estimated cumulative hazard.

Example 5 The cumulative hazard function for the exponential distribution with rate parameter \(\lambda\) is:

\[ {\Lambda}(t) = \mathbb{1}_{t \ge 0}\cdot \lambda t \]

Figure 3 shows some examples of exponential cumulative hazard functions.

Figure 3: Examples of exponential cumulative hazard functions

Some Key Mathematical Relationships among Survival Concepts

Diagram:

\[ \text{f}(t) \xleftarrow[\text{S}(t){\lambda}(t)]{-S'(t)} \text{S}(t) \xleftarrow[]{\text{exp}{\left\{-{\Lambda}(t)\right\}}} {\Lambda}(t) \xleftarrow[]{\int_{u=0}^t {\lambda}(u)du} {\lambda}(t) \xleftarrow[]{\text{exp}{\left\{\eta(t)\right\}}} \eta(t) \]

\[ \text{f}(t) \xrightarrow[\int_{u=t}^\infty \text{f}(u)du]{\text{f}(t)/{\lambda}(t)} \text{S}(t) \xrightarrow[-\log{\text{S}(t)}]{} {\Lambda}(t) \xrightarrow[{\Lambda}'(t)]{} {\lambda}(t) \xrightarrow[\text{log}{\left\{{\lambda}(t)\right\}}]{} \eta(t) \]

Identities:

\[ \begin{aligned} \text{S}(t) &= 1 - \text{F}(t)\\ &= \text{exp}{\left\{-{\Lambda}(t)\right\}}\\ \text{S}'(t) &= -f(t)\\ {\Lambda}(t) &= -\text{log}{\left\{\text{S}(t)\right\}}\\ {\Lambda}'(t) &= {\lambda}(t)\\ {\lambda}(t) &= \frac{\text{f}(t)}{\text{S}(t)}\\ &= -\frac{\partial}{\partial t}\log{\text{S}(t)} \\ f(t) &= {\lambda}(t)\cdot \text{S}(t)\\ \end{aligned} \]

Some proofs (others left as exercises):

\[ \begin{aligned} S'(t) &= \frac{\partial}{\partial t}(1-F(t))\\ &= -F'(t)\\ &= -f(t)\\ \end{aligned} \]

\[ \begin{aligned} \frac{\partial}{\partial t}\log{\text{S}(t)} &= \frac{S'(t)}{\text{S}(t)}\\ &= -\frac{f(t)}{\text{S}(t)}\\ &= -{\lambda}(t)\\ \end{aligned} \]

\[ \begin{aligned} {\Lambda}(t) &\stackrel{\text{def}}{=}\int_{u=0}^t h(u) du\\ &= \int_0^t -\frac{\partial}{\partial u}\text{log}\left\{S(u)\right\} du\\ &= \left[-\text{log}\left\{S(u)\right\}\right]_{u=0}^{u=t}\\ &= \left[\text{log}\left\{S(u)\right\}\right]_{u=t}^{u=0}\\ &= \text{log}{\left\{\text{S}(0)\right\}} - \text{log}{\left\{\text{S}(t)\right\}}\\ &= \text{log}{\left\{1\right\}} - \text{log}{\left\{\text{S}(t)\right\}}\\ &= 0 - \text{log}{\left\{\text{S}(t)\right\}}\\ &=-\text{log}{\left\{\text{S}(t)\right\}} \end{aligned} \]

Corollary:

\[\text{S}(t) = \text{exp}\left\{-{\Lambda}(t)\right\}\]

Example: Time to death the US in 2004

The first day is the most dangerous:

[R code]
# download `survexp.rda` from:
# paste0(
# "https://github.com/therneau/survival/raw/",
# "f3ac93704949ff26e07720b56f2b18ffa8066470/",
# "Data/survexp.rda")

# (newer versions of `survival` don't have the first-year breakdown; see:
# https://cran.r-project.org/web/packages/survival/news.html)

fs::path(
  here::here(),
  "Data",
  "survexp.rda"
) |>
  load()
s1 <- survexp.us[, "female", "2004"]
age1 <- c(
  0.5 / 365.25,
  4 / 365.25,
  17.5 / 365.25,
  196.6 / 365.25,
  1:109 + 0.5
)
s2 <- 365.25 * s1[5:113]
s2 <- c(s1[1], 6 * s1[2], 22 * s1[3], 337.25 * s1[4], s2)
cols <- rep(1, 113)
cols[1] <- 2
cols[2] <- 3
cols[3] <- 4

plot(age1, s1, type = "b", lwd = 2, xlab = "Age", ylab = "Daily Hazard Rate", col = cols)

text(10, .003, "First Day", col = 2)
text(18, .00030, "Rest of First Week", col = 3)
text(18, .00015, "Rest of First month", col = 4)
Figure 4: Daily Hazard Rates in 2004 for US Females

Exercise 1 Hypothesize why the male and female hazard functions in Figure 5 differ where they do?

[R code]
yrs <- 1:40
s1 <- survexp.us[5:113, "male", "2004"]
s2 <- survexp.us[5:113, "female", "2004"]

age1 <- 1:109

plot(age1[yrs], s1[yrs], type = "l", lwd = 2, xlab = "Age", ylab = "Daily Hazard Rate")
lines(age1[yrs], s2[yrs], col = 2, lwd = 2)
legend(5, 5e-6, c("Males", "Females"), col = 1:2, lwd = 2)
Figure 5: Daily Hazard Rates in 2004 for US Males and Females 1-40

Exercise 2 Compare and contrast Figure 6 with Figure 4.

[R code]
s1 <- survexp.us[, "female", "2004"]

s2 <- 365.25 * s1[5:113]
s2 <- c(s1[1], 6 * s1[2], 21 * s1[3], 337.25 * s1[4], s2)
cs2 <- cumsum(s2)
age2 <- c(1 / 365.25, 7 / 365.25, 28 / 365.25, 1:110)
plot(age2, exp(-cs2), type = "l", lwd = 2, xlab = "Age", ylab = "Survival")
Figure 6: Survival Curve in 2004 for US Females

Likelihood with censoring

If an event time \(T\) is observed exactly as \(T=t\), then the likelihood of that observation is just its probability density function:

\[ \begin{aligned} \mathscr{L}(t) &= \text{f}(T=t)\\ &\stackrel{\text{def}}{=}\text{f}_T(t)\\ &= {\lambda}_T(t)\text{S}_T(t)\\ \ell(t) &\stackrel{\text{def}}{=}\text{log}{\left\{\mathscr{L}(t)\right\}}\\ &= \text{log}{\left\{{\lambda}_T(t) \text{S}_T(t)\right\}}\\ &= \text{log}{\left\{{\lambda}_T(t)\right\}} + \text{log}{\left\{\text{S}_T(t)\right\}}\\ &= \text{log}{\left\{{\lambda}_T(t)\right\}} - {\Lambda}_T(t)\\ \end{aligned} \]

If instead the event time \(T\) is censored and only known to be after time \(y\), then the likelihood of that censored observation is instead the survival function evaluated at the censoring time:

\[ \begin{aligned} \mathscr{L}(y) &=p_T(T>y)\\ &\stackrel{\text{def}}{=}\text{S}_T(y)\\ \ell(y) &\stackrel{\text{def}}{=}\text{log}{\left\{\mathscr{L}(y)\right\}}\\ &=\text{log}{\left\{\text{S}(y)\right\}}\\ &=-{\Lambda}(y)\\ \end{aligned} \]

\[ \begin{aligned} \mathscr{L}(y,d) &= \text{p}(Y=y, D=d)\\ &= {\left[\text{p}(T=y,C> y)\right]}^d \cdot {\left[\text{p}(T>y,C=y)\right]}^{1-d}\\ \end{aligned} \]

\[ \begin{aligned} \mathscr{L}(y,d) &= {\left[\text{p}(T=y,C> y)\right]}^d\cdot {\left[\text{p}(T>y,C=y)\right]}^{1-d}\\ &= {\left[\text{p}(T=y)\text{p}(C> y)\right]}^d\cdot {\left[\text{p}(T>y)\text{p}(C=y)\right]}^{1-d}\\ &= {\left[\text{f}_T(y)\text{S}_C(y)\right]}^d\cdot {\left[\text{S}(y)\text{f}_C(y)\right]}^{1-d}\\ &= {\left[\text{f}_T(y)^d \text{S}_C(y)^d\right]}\cdot {\left[\text{S}_T(y)^{1-d} \text{f}_C(y)^{1-d}\right]}\\ &= {\left(\text{f}_T(y)^d \cdot \text{S}_T(y)^{1-d}\right)}\cdot {\left(\text{f}_C(y)^{1-d} \cdot \text{S}_C(y)^{d}\right)} \end{aligned} \]

\[ \begin{aligned} \ell(y,d) &= \text{log}{\left\{\mathscr{L}(y,d)\right\}}\\ &= \text{log}{\left\{ {\left(f_T(y)^d \cdot S_T(y)^{1-d}\right)} \cdot {\left(f_C(y)^{1-d} \cdot S_C(y)^{d}\right)} \right\}} \\ &= \text{log}{\left\{f_T(y)^d \cdot S_T(y)^{1-d}\right\}} + \text{log}{\left\{f_C(y)^{1-d} \cdot S_C(y)^{d}\right\}} \end{aligned} \] Let

  • \(\theta_T\) represent the parameters of \(p_T(t)\),
  • \(\theta_C\) represent the parameters of \(p_C(c)\),
  • \(\theta = (\theta_T, \theta_C)\) be the combined vector of all parameters.

\[ \begin{aligned} \ell'(y,d) &= \frac{\partial}{\partial \theta} {\left[ \text{log}{\left\{f_T(y)^d \cdot S_T(y)^{1-d}\right\}} + \text{log}{\left\{f_C(y)^{1-d} \cdot S_C(y)^{d}\right\}} \right]} \\ &= {\left( \frac{\partial}{\partial \theta} \text{log}{\left\{ f_T(y)^d \cdot S_T(y)^{1-d} \right\}} \right)} + {\left( \frac{\partial}{\partial \theta} \text{log}{\left\{ f_C(y)^{1-d} \cdot S_C(y)^{d} \right\}} \right)} \end{aligned} \]

\[ \begin{aligned} \ell'_{\theta_T}(y,d) &\stackrel{\text{def}}{=}\frac{\partial}{\partial \theta_T}\ell(y,d)\\ &= \left( \frac{\partial}{\partial \theta_T} \text{log}\left\{ f_T(y)^d \cdot S_T(y)^{1-d} \right\} \right) + \left( \frac{\partial}{\partial \theta_T} \text{log}\left\{ f_C(y)^{1-d} \cdot S_C(y)^{d} \right\} \right)\\ &= \left( \frac{\partial}{\partial \theta_T} \text{log}\left\{ f_T(y)^d \cdot S_T(y)^{1-d} \right\} \right) + 0\\ &= \frac{\partial}{\partial \theta_T} \text{log}\left\{ f_T(y)^d \cdot S_T(y)^{1-d} \right\}\\ \end{aligned} \]

Then:

\[ \begin{aligned} \mathscr{L}(y,d) &= f_T(y)^d \cdot S_T(y)^{1-d}\\ &= \left(h_T(y)^d S_T(y)^d\right) \cdot S_T(y)^{1-d}\\ &= h_T(y)^d \cdot S_T(y)^d \cdot S_T(y)^{1-d}\\ &= h_T(y)^d \cdot S_T(y)\\ &= S_T(y) \cdot h_T(y)^d \\ \end{aligned} \]

\[ \begin{aligned} \ell(y,d) &=\text{log}\left\{\mathscr{L}(y,d)\right\}\\ &= \text{log}\left\{S_T(y) \cdot h_T(y)^d\right\}\\ &= \text{log}\left\{S_T(y)\right\} + \text{log}\left\{h_T(y)^d\right\}\\ &= \text{log}\left\{S_T(y)\right\} + d\cdot \text{log}\left\{h_T(y)\right\}\\ &= -H_T(y) + d\cdot \text{log}\left\{h_T(y)\right\}\\ \end{aligned} \]

1.4 Parametric Models for Time-to-Event Outcomes

Exponential Distribution

  • The exponential distribution is the base distribution for survival analysis.
  • The distribution has a constant hazard \(\lambda\)
  • The mean survival time is \(\lambda^{-1}\)

Mathematical details of exponential distribution

\[ \begin{aligned} f(t) &= \lambda \text{e}^{-\lambda t} \\ F(t) &= 1-\text{e}^{-\lambda t}\\ \text{S}(t)&= \text{e}^{-\lambda t}\\ \ln(\text{S}(t))&=-\lambda t\\ {\lambda}(t) &= -\frac{f(t)}{\text{S}(t)} = -\frac{\lambda \text{e}^{-\lambda t}}{\text{e}^{-\lambda t}}=\lambda \\ E(t) &= \lambda^{-1} \\ Var(t) &= \lambda^{-2} \\ \text{log}{\left\{f(t)\right\}} &= \text{log}{\left\{\lambda\right\}} - \lambda t \\ \frac{\partial}{\partial \lambda} \text{log}{\left\{f(t)\right\}} &= \lambda^{-1} - t \\ &= \text{E}{\left[t\right]} - t \\ &= -(\text{E}{\left[t\right]} - t) \\ &= -\varepsilon \end{aligned} \]

Prediction intervals for time-to-event outcomes

Exercise 3 (Construct a prediction interval) Suppose a cancer patient is predicted to have an expected (mean) lifetime of 7 years after diagnosis, and suppose the distribution is exponential.

Construct a 95% prediction interval for survival.

Tip

Use the quantiles of the exponential distribution.

Solution 1. If the mean is 7 years until death, then the rate parameter \(\lambda = 1/7\) events (deaths) per year.

The 0.025 quantile of the exponential distribution with \(\lambda = 1/7\) is qexp(p 0.025, rate = 1/7) = 0.177225 and the 0.975 quantile is qexp(p 0.975, rate = 1/7) = 25.822156, so the prediction interval is qexp(p c(.025, 0.975), rate = 1/7) = (0.177225, 25.822156).

Exercise 4 Graph the prediction interval as a function of the mean, for Gaussian (\(\sigma = 1\)), Binomial, Poisson, and Exponential.

Solution 2. Left to the reader for now.

Exercise 5 (Explain the results) Why do time-to-event models have such wide predictive intervals?

Tip

Consider the relationship between the mean, variance, and standard deviation of the exponential distribution, and contrast that relationship with the Poisson distribution and the Bernoulli distribution.

Solution 3. In the exponential distribution, variance is the square of the mean (hence SD is equal to mean); as opposed to Poisson, where variance was equal to the mean (and SD is the square-root of the mean), or the Bernoulli, where the variance is the mean minus the square of the mean (so the SD is smaller than the square-root of the mean).

Estimating \(\lambda\)

  • Suppose we have \(m\) exponential survival times of \(t_1, t_2,\ldots,t_m\) and \(k\) right-censored values at \(u_1,u_2,\ldots,u_k\).

  • A survival time of \(t_i=10\) means that subject \(i\) died at time 10. A right-censored time \(u_i=10\) means that at time 10, subject \(i\) was still alive and that we have no further follow-up.

  • For the moment we will assume that the survival distribution is exponential and that all the subjects have the same parameter \(\lambda\).

We have \(m\) exponential survival times of \(t_1, t_2,\ldots,t_m\) and \(k\) right-censored values at \(u_1,u_2,\ldots,u_k\). The log-likelihood of an observed survival time \(t_i\) is \[ \text{log}\left\{\lambda \text{e}^{-\lambda t_i}\right\} = \text{log}\left\{\lambda\right\}-\lambda t_i \] and the likelihood of a censored value is the probability of that outcome (survival greater than \(u_j\)) so the log-likelihood is

\[ \begin{aligned} \ell_j(\lambda) &= \text{log}\left\{\lambda \text{e}^{u_j}\right\} \\ &= -\lambda u_j \end{aligned} \]

Theorem 7 Let \(T=\sum t_i\) and \(U=\sum u_j\). Then:

\[ \hat{\lambda}_{ML} = \frac{m}{T+U} \tag{2}\]

Proof. \[ \begin{aligned} \ell(\lambda) &= \sum_{i=1}^m( \ln \lambda-\lambda t_i) + \sum_{j=1}^k (-\lambda u_j)\\ &= m \ln \lambda -(T+U)\lambda\\ \ell'(\lambda) &=m\lambda^{-1} -(T+U)\\ \hat{\lambda} &= \frac{m}{T+U} \end{aligned} \]

\[ \begin{aligned} \ell''&=-m/\lambda^2\\ &< 0\\ \hat E[T] &= \hat\lambda^{-1}\\ &= \frac{T+U}{m} \end{aligned} \]

Fisher Information and Standard Error

\[ \begin{aligned} E[-\ell''] & = m/\lambda^2\\ \text{Var}{\left(\hat\lambda\right)} &\approx {\left(E[-\ell'']\right)}^{-1}\\ &=\lambda^2/m\\ \text{SE}\left(\hat\lambda\right) &= \sqrt{\text{Var}{\left(\hat\lambda\right)}}\\ &\approx \lambda/\sqrt{m} \end{aligned} \]

Other Parametric Survival Distributions

  • Any density on \([0,\infty)\) can be a survival distribution, but the most useful ones are all skew right.
  • The most frequently used generalization of the exponential is the Weibull.
  • Other common choices are the gamma, log-normal, log-logistic, Gompertz, inverse Gaussian, and Pareto.
  • Most of what we do going forward is non-parametric or semi-parametric, but sometimes these parametric distributions provide a useful approach.

1.5 Nonparametric Survival Analysis

Basic ideas

  • Mostly, we work without a parametric model.

  • The first task is to estimate a survival function from data listing survival times, and censoring times for censored data.

  • For example one patient may have relapsed at 10 months. Another might have been followed for 32 months without a relapse having occurred (censored).

  • The minimum information we need for each patient is a time and a censoring variable which is 1 if the event occurred at the indicated time and 0 if this is a censoring time.

1.6 Example: clinical trial for pediatric acute leukemia

Overview of study

This is from a clinical trial in 1963 for 6-MP treatment vs. placebo for Acute Leukemia in 42 children.

  • Pairs of children:

  • matched by remission status at the time of treatment (remstat: 1 = partial, 2 = complete)

  • randomized to 6-MP (exit times in t2) or placebo (exit times in t1)

  • Followed until relapse or end of study.

  • All of the placebo group relapsed, but some of the 6-MP group were censored (which means they were still in remission); indicated by relapse variable (0 = censored, 1 = relapse).

  • 6-MP = 6-Mercaptopurine (Purinethol) is an anti-cancer (“antineoplastic” or “cytotoxic”) chemotherapy drug used currently for Acute lymphoblastic leukemia (ALL). It is classified as an antimetabolite.

Study design

  • Clinical trial in 1963 for 6-MP treatment vs. placebo for Acute Leukemia in 42 children.
  • Pairs of children:
  • matched by remission status at the time of treatment (remstat)
  • remstat = 1: partial
  • remstat = 2: complete
  • randomized to 6-MP (exit time: t2) or placebo (t1).
  • Followed until relapse or end of study.
  • All of the placebo group relapsed,
  • Some of the 6-MP group were censored.
Table 1: drug6mp pediatric acute leukemia data
[R code]
library(KMsurv)
data(drug6mp)
drug6mp <- drug6mp |>
  tibble::as_tibble() |>
  print()
#> # A tibble: 21 × 5
#>     pair remstat    t1    t2 relapse
#>    <int>   <int> <int> <int>   <int>
#>  1     1       1     1    10       1
#>  2     2       2    22     7       1
#>  3     3       2     3    32       0
#>  4     4       2    12    23       1
#>  5     5       2     8    22       1
#>  6     6       1    17     6       1
#>  7     7       2     2    16       1
#>  8     8       2    11    34       0
#>  9     9       2     8    32       0
#> 10    10       2    12    25       0
#> # ℹ 11 more rows

Data documentation for drug6mp

[R code]
# library(printr) # inserts help-file output into markdown output
library(KMsurv)
?drug6mp

Descriptive Statistics

Table 2: Summary statistics for drug6mp data
[R code]
summary(drug6mp)
#>       pair       remstat           t1              t2          relapse     
#>  Min.   : 1   Min.   :1.00   Min.   : 1.00   Min.   : 6.0   Min.   :0.000  
#>  1st Qu.: 6   1st Qu.:2.00   1st Qu.: 4.00   1st Qu.: 9.0   1st Qu.:0.000  
#>  Median :11   Median :2.00   Median : 8.00   Median :16.0   Median :0.000  
#>  Mean   :11   Mean   :1.76   Mean   : 8.67   Mean   :17.1   Mean   :0.429  
#>  3rd Qu.:16   3rd Qu.:2.00   3rd Qu.:12.00   3rd Qu.:23.0   3rd Qu.:1.000  
#>  Max.   :21   Max.   :2.00   Max.   :23.00   Max.   :35.0   Max.   :1.000

Exponential model

\[\hat\lambda = \frac{\sum_{i=1}^nD_i}{\sum_{i=1}^nY_i}\]

\[ \begin{aligned} \hat \lambda_{\text{placebo}} &= \frac{\sum_{i=1}^nD_i}{\sum_{i=1}^nY_i} \\ &= \frac{\sum_{i=1}^n1}{\sum_{i=1}^nY_i} \\ &= \frac{n}{\sum_{i=1}^nY_i} \\ &= \frac{1}{\bar{Y}} \\ &= \frac{1}{8.666667} \\ &= 0.115385 \end{aligned} \]

  • For the 6-MP group, \(\hat\lambda = 9/359 = 0.025\)

\[ \begin{aligned} \hat \lambda_{\text{6-MP}} &= \frac{\sum_{i=1}^nD_i}{\sum_{i=1}^nY_i} \\ &= \frac{9}{359} \\ &= 0.02507 \end{aligned} \]

  • The estimated hazard in the placebo group is 4.6 times as large as in the 6-MP group (assuming the hazard is constant over time).

1.7 The Kaplan-Meier Product Limit Estimator

Estimating survival in datasets without censoring

Estimating survival in datasets with censoring

  • For the 6-MP patients, we cannot ignore the censored data because we know that the time to relapse is greater than the censoring time.

  • For any time \(t\) in months, we know that 6-MP patients with times greater than \(t\) have not relapsed, and those with relapse time less than \(t\) have relapsed, but we don’t know if patients with censored time less than \(t\) have relapsed or not.

  • The procedure we usually use is the Kaplan-Meier product-limit estimator of the survival function.

  • The Kaplan-Meier estimator is a step function (like the empirical cdf), which changes value only at the event times, not at the censoring times.

  • At each event time \(t\), we compute the at-risk group size \(Y\), which is all those observations whose event time or censoring time is at least \(t\).

  • If \(d\) of the observations have an event time (not a censoring time) of \(t\), then the group of survivors immediately following time \(t\) is reduced by the fraction \[\frac{Y-d}{Y}=1-\frac{d}{Y}\]

Definition 6 (Kaplan-Meier Product-Limit Estimator of Survival Function) If a time-to-event data set contains \(k\) event times \(t_i\), (\(i \in {1:k}\)), where \(n_i\) is the number of individuals at risk at time \(t_i\) and \(d_i\) is the number of events at time \(t_i\), then the Kaplan-Meier Product-Limit Estimator of the survival function is:

\[\hat{\lambda}_i = \frac{d_i}{n_i}\]

\[\hat{\text{S}}_{KM}(t) \stackrel{\text{def}}{=}\prod_{{\left\{i:\ t_i < t\right\}}} {\left[1-\hat{\lambda}_i\right]}\]

Theorem 8 (Kaplan-Meier Estimate with No Censored Observations) If there are no censored data, and there are \(n\) data points, then just after (say) the third event time

\[ \begin{aligned} \hat{\text{S}}(t) &= \prod_{t_i < t}{\left[1-\frac{d_i}{Y_i}\right]} \\ &= {\left[\frac{n-d_1}{n}\right]} {\left[\frac{n-d_1-d_2}{n-d_1}\right]} {\left[\frac{n-d_1-d_2-d_3}{n-d_1-d_2}\right]} \\ &= \frac{n-d_1-d_2-d_3}{n} \\ &=1-\frac{d_1+d_2+d_3}{n} \\ &=1-\hat F(t) \end{aligned} \]

where \(\hat F(t)\) is the usual empirical CDF estimate.

Kaplan-Meier curve for drug6mp data

Here is the Kaplan-Meier estimated survival curve for the patients who received 6-MP in the drug6mp dataset (we will see code to produce figures like this one shortly):

[R code]
# | echo: false

require(KMsurv)
data(drug6mp)
library(dplyr)
library(survival)

drug6mp_km_model1 <-
  drug6mp |>
  mutate(surv = Surv(t2, relapse)) |>
  survfit(formula = surv ~ 1, data = _)

library(ggfortify)
drug6mp_km_model1 |>
  autoplot(
    mark.time = TRUE,
    conf.int = FALSE
  ) +
  expand_limits(y = 0) +
  xlab("Time since diagnosis (months)") +
  ylab("KM Survival Curve")
Figure 7: Kaplan-Meier Survival Curve for 6-MP Patients

Kaplan-Meier calculations

Let’s compute these estimates and build the chart by hand:

[R code]
library(KMsurv)
library(dplyr)
data(drug6mp)

drug6mp.v2 <-
  drug6mp |>
  as_tibble() |>
  mutate(
    remstat = remstat |>
      case_match(
        1 ~ "partial",
        2 ~ "complete"
      ),
    # renaming to "outcome" while relabeling is just a style choice:
    outcome = relapse |>
      case_match(
        0 ~ "censored",
        1 ~ "relapsed"
      )
  )

km.6mp <-
  drug6mp.v2 |>
  summarize(
    .by = t2,
    Relapses = sum(outcome == "relapsed"),
    Censored = sum(outcome == "censored")
  ) |>
  # here we add a start time row, so the graph starts at time 0:
  bind_rows(
    tibble(
      t2 = 0,
      Relapses = 0,
      Censored = 0
    )
  ) |>
  # sort in time order:
  arrange(t2) |>
  mutate(
    Exiting = Relapses + Censored,
    `Study Size` = sum(Exiting),
    Exited = cumsum(Exiting) |> dplyr::lag(default = 0),
    `At Risk` = `Study Size` - Exited,
    Hazard = Relapses / `At Risk`,
    `KM Factor` = 1 - Hazard,
    `Cumulative Hazard` = cumsum(`Hazard`),
    `KM Survival Curve` = cumprod(`KM Factor`)
  )

library(pander)
pander(km.6mp)
t2 Relapses Censored Exiting Study Size Exited At Risk Hazard KM Factor Cumulative Hazard KM Survival Curve
0 0 0 0 21 0 21 0 1 0 1
6 3 1 4 21 0 21 0.1429 0.8571 0.1429 0.8571
7 1 0 1 21 4 17 0.05882 0.9412 0.2017 0.8067
9 0 1 1 21 5 16 0 1 0.2017 0.8067
10 1 1 2 21 6 15 0.06667 0.9333 0.2683 0.7529
11 0 1 1 21 8 13 0 1 0.2683 0.7529
13 1 0 1 21 9 12 0.08333 0.9167 0.3517 0.6902
16 1 0 1 21 10 11 0.09091 0.9091 0.4426 0.6275
17 0 1 1 21 11 10 0 1 0.4426 0.6275
19 0 1 1 21 12 9 0 1 0.4426 0.6275
20 0 1 1 21 13 8 0 1 0.4426 0.6275
22 1 0 1 21 14 7 0.1429 0.8571 0.5854 0.5378
23 1 0 1 21 15 6 0.1667 0.8333 0.7521 0.4482
25 0 1 1 21 16 5 0 1 0.7521 0.4482
32 0 2 2 21 17 4 0 1 0.7521 0.4482
34 0 1 1 21 19 2 0 1 0.7521 0.4482
35 0 1 1 21 20 1 0 1 0.7521 0.4482

Summary

For the 6-MP patients at time 6 months, there are 21 patients at risk. At \(t=6\) there are 3 relapses and 1 censored observations.

The Kaplan-Meier factor is \((21-3)/21 = 0.857\). The number at risk for the next time (\(t=7\)) is \(21-3-1=17\).

At time 7 months, there are 17 patients at risk. At \(t=7\) there is 1 relapse and 0 censored observations. The Kaplan-Meier factor is \((17-1)/17 = 0.941\). The Kaplan Meier estimate is \(0.857\times0.941=0.807\). The number at risk for the next time (\(t=9\)) is \(17-1=16\).

Now, let’s graph this estimated survival curve using ggplot():

[R code]
library(ggplot2)
conflicts_prefer(dplyr::filter)
km.6mp |>
  ggplot(aes(x = t2, y = `KM Survival Curve`)) +
  geom_step() +
  geom_point(data = km.6mp |> filter(Censored > 0), shape = 3) +
  expand_limits(y = c(0, 1), x = 0) +
  xlab("Time since diagnosis (months)") +
  ylab("KM Survival Curve") +
  scale_y_continuous(labels = scales::percent)
Figure 8: KM curve for 6MP patients, calculated by hand

1.8 Using the survival package in R

We don’t have to do these calculations by hand every time; the survival package and several others have functions available to automate many of these tasks (full list: https://cran.r-project.org/web/views/Survival.html).

The Surv function

To use the survival package, the first step is telling R how to combine the exit time and exit reason (censoring versus event) columns. The Surv() function accomplishes this task.

Example: Surv() with drug6mp data

[R code]
library(survival)
drug6mp.v3 <-
  drug6mp.v2 |>
  mutate(
    surv2 = Surv(
      time = t2,
      event = (outcome == "relapsed")
    )
  )

print(drug6mp.v3)
#> # A tibble: 21 × 7
#>     pair remstat     t1    t2 relapse outcome   surv2
#>    <int> <chr>    <int> <int>   <int> <chr>    <Surv>
#>  1     1 partial      1    10       1 relapsed    10 
#>  2     2 complete    22     7       1 relapsed     7 
#>  3     3 complete     3    32       0 censored    32+
#>  4     4 complete    12    23       1 relapsed    23 
#>  5     5 complete     8    22       1 relapsed    22 
#>  6     6 partial     17     6       1 relapsed     6 
#>  7     7 complete     2    16       1 relapsed    16 
#>  8     8 complete    11    34       0 censored    34+
#>  9     9 complete     8    32       0 censored    32+
#> 10    10 complete    12    25       0 censored    25+
#> # ℹ 11 more rows

The output of Surv() is a vector of objects with class Surv. When we print this vector:

  • observations where the event was observed are printed as the event time (for example, surv2 = 10 on line 1)

  • observations where the event was right-censored are printed as the censoring time with a plus sign (+; for example, surv2 = 32+ on line 3).

The survfit function

Once we have constructed our Surv variable, we can calculate the Kaplan-Meier estimate of the survival curve using the survfit() function.

Note

The documentation for ?survfit isn’t too helpful; the survfit.formula documentation is better.

Example: survfit() with drug6mp data

Here we use survfit() to create a survfit object, which contains the Kaplan-Meier estimate:

[R code]
drug6mp.km_model <- survfit(
  formula = surv2 ~ 1,
  data = drug6mp.v3
)

print.survfit() just gives some summary statistics:

[R code]
print(drug6mp.km_model)
#> Call: survfit(formula = surv2 ~ 1, data = drug6mp.v3)
#> 
#>       n events median 0.95LCL 0.95UCL
#> [1,] 21      9     23      16      NA

summary.survfit() shows us the underlying Kaplan-Meier table:

[R code]
summary(drug6mp.km_model)
#> Call: survfit(formula = surv2 ~ 1, data = drug6mp.v3)
#> 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>     6     21       3    0.857  0.0764        0.720        1.000
#>     7     17       1    0.807  0.0869        0.653        0.996
#>    10     15       1    0.753  0.0963        0.586        0.968
#>    13     12       1    0.690  0.1068        0.510        0.935
#>    16     11       1    0.627  0.1141        0.439        0.896
#>    22      7       1    0.538  0.1282        0.337        0.858
#>    23      6       1    0.448  0.1346        0.249        0.807

We can specify which time points we want using the times argument:

[R code]
summary(
  drug6mp.km_model,
  times = c(0, drug6mp.v3$t2)
)
#> Call: survfit(formula = surv2 ~ 1, data = drug6mp.v3)
#> 
#>  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#>     0     21       0    1.000  0.0000        1.000        1.000
#>    10     15       1    0.753  0.0963        0.586        0.968
#>     7     17       1    0.807  0.0869        0.653        0.996
#>    32      4       0    0.448  0.1346        0.249        0.807
#>    23      6       1    0.448  0.1346        0.249        0.807
#>    22      7       1    0.538  0.1282        0.337        0.858
#>     6     21       3    0.857  0.0764        0.720        1.000
#>    16     11       1    0.627  0.1141        0.439        0.896
#>    34      2       0    0.448  0.1346        0.249        0.807
#>    32      4       0    0.448  0.1346        0.249        0.807
#>    25      5       0    0.448  0.1346        0.249        0.807
#>    11     13       0    0.753  0.0963        0.586        0.968
#>    20      8       0    0.627  0.1141        0.439        0.896
#>    19      9       0    0.627  0.1141        0.439        0.896
#>     6     21       3    0.857  0.0764        0.720        1.000
#>    17     10       0    0.627  0.1141        0.439        0.896
#>    35      1       0    0.448  0.1346        0.249        0.807
#>     6     21       3    0.857  0.0764        0.720        1.000
#>    13     12       1    0.690  0.1068        0.510        0.935
#>     9     16       0    0.807  0.0869        0.653        0.996
#>     6     21       3    0.857  0.0764        0.720        1.000
#>    10     15       1    0.753  0.0963        0.586        0.968
[R code]
?summary.survfit

Plotting estimated survival functions

We can plot survfit objects with plot(), autoplot(), or ggsurvplot():

[R code]
library(ggfortify)
autoplot(drug6mp.km_model)
Kaplan-Meier Survival Curve for 6-MP Patients

Kaplan-Meier Survival Curve for 6-MP Patients

[R code]

# not shown:
# plot(drug6mp.km_model)

# library(survminer)
# ggsurvplot(drug6mp.km_model)

quantiles of survival curve

We can extract quantiles with quantile():

[R code]
drug6mp.km_model |>
  quantile(p = c(.25, .5)) |>
  as_tibble() |>
  mutate(p = c(.25, .5)) |>
  relocate(p, .before = everything())

1.9 The log-rank test

(a.k.a. the Mantel-Cox test)

Exercise 6 How do we test the null hypothesis that two or more groups have the same time-to-event distribution?

Solution 4. One option is the log-rank test comparing the Kaplan-Meier estimates of the survival functions of those groups.

For \(t \in t_1,...,t_n\): \[ \begin{aligned} \hat{{\lambda}}_t &= \frac{\sum_x m_{x,t}}{\sum_x n_{x,t}} \\ \hat{E}_{t,x} &= \hat{{\lambda}}_t * n_{x,t} \end{aligned} \]

The survdiff function

[R code]
?survdiff

Example: survdiff() with drug6mp data

Now we are going to compare the placebo and 6-MP data. We need to reshape the data to make it usable with the standard survival workflow:

[R code]
library(survival)
library(tidyr)
drug6mp.v4 <-
  drug6mp.v3 |>
  select(pair, remstat, t1, t2, outcome) |>
  # here we are going to change the data from a wide format to long:
  pivot_longer(
    cols = c(t1, t2),
    names_to = "treatment",
    values_to = "exit_time"
  ) |>
  mutate(
    treatment = treatment |>
      case_match(
        "t1" ~ "placebo",
        "t2" ~ "6-MP"
      ),
    outcome = if_else(
      treatment == "placebo",
      "relapsed",
      outcome
    ),
    surv = Surv(
      time = exit_time,
      event = (outcome == "relapsed")
    )
  )

Using this long data format, we can fit a Kaplan-Meier curve for each treatment group simultaneously:

[R code]
drug6mp.km_model2 <-
  survfit(
    formula = surv ~ treatment,
    data = drug6mp.v4
  )

We can plot the curves in the same graph:

[R code]
drug6mp.km_model2 |> autoplot()

We can also perform something like a t-test, where the null hypothesis is that the curves are the same:

[R code]

o_e <- drug6mp.v4 |>
  arrange(exit_time) |>
  mutate(
    .by = treatment,
    n_exited = row_number(),
    n_at_risk = n() - n_exited + 1
  ) |>
  dplyr::summarize(
    .by = all_of(c("exit_time", "treatment")),
    n_at_risk = max(n_at_risk),
    n_events = sum(outcome == "relapsed")
  ) |>
  tidyr::pivot_wider(
    names_from = "treatment",
    values_from = c(n_at_risk, n_events)
  ) |>
  tidyr::fill(
    starts_with("n_at_risk"),
    .direction = "up"
  ) |>
  replace_na(list("n_events_placebo" = 0,
                  "n_events_6-MP" = 0)) |>
  mutate(
    n_at_risk = rowSums(across(starts_with("n_at_risk"))),
    n_events = rowSums(across(starts_with("n_events"))),
    marginal_hazard = n_events / n_at_risk,
    expected_6mp = marginal_hazard * `n_at_risk_6-MP`,
    expected_plc = marginal_hazard * n_at_risk_placebo,
    diff_6mp = `n_events_6-MP` - expected_6mp,
    diff_plc = n_events_placebo - expected_plc
  ) |>
  filter(n_events > 0)

o_e
Table 3: Observed and expected event counts for the 6-MP data, for log-rank test
[R code]
o_e_summ <- o_e |>
  summarize(
    across(starts_with("expected"), sum),
    across(starts_with("n_events_"), sum)
  )
pander::pander(o_e_summ)
Table 4: Observed and expected sums for the 6-MP data, for log-rank test
expected_6mp expected_plc n_events_placebo n_events_6-MP
19.25 10.75 21 9

\[ \text{Var}{\left(O_i - E_i\right)} = \sum_j \frac {n_{1j} n_{2j} (m_j) (n_j - m_j)} {(n_j)^2 (n_j - 1)} \]

\[X^2 \approx \sum_{i = 1}^p \frac{(O_i - E_i)^2}{E_i}\]

[R code]
with(
  o_e_summ,
  tibble(
    "6mp" = (`n_events_6-MP` - expected_6mp)^2 / expected_6mp,
    "placebo" = (n_events_placebo - expected_plc)^2 / expected_plc,
    sum = `6mp` + placebo
  )
) |>
  pander::pander()
6mp placebo sum
5.458 9.775 15.23

R gives us both the exact and approximate results:

[R code]
survdiff(
  formula = surv ~ treatment,
  data = drug6mp.v4
)
#> Call:
#> survdiff(formula = surv ~ treatment, data = drug6mp.v4)
#> 
#>                    N Observed Expected (O-E)^2/E (O-E)^2/V
#> treatment=6-MP    21        9     19.3      5.46      16.8
#> treatment=placebo 21       21     10.7      9.77      16.8
#> 
#>  Chisq= 16.8  on 1 degrees of freedom, p= 4e-05
[R code]
lrank_test <- survdiff(
  formula = surv ~ treatment + strata(pair),
  data = drug6mp.v4
)
lrank_test
#> Call:
#> survdiff(formula = surv ~ treatment + strata(pair), data = drug6mp.v4)
#> 
#>                    N Observed Expected (O-E)^2/E (O-E)^2/V
#> treatment=6-MP    21        9     16.5      3.41      10.7
#> treatment=placebo 21       21     13.5      4.17      10.7
#> 
#>  Chisq= 10.7  on 1 degrees of freedom, p= 0.001

1.10 Example: Bone Marrow Transplant Data

Data from Copelan et al. (1991)

Figure 9: Recovery process from a bone marrow transplant (Fig. 1.1 from Klein and Moeschberger (2003))

Study design

Treatment

  • allogeneic (from a donor) bone marrow transplant therapy

Inclusion criteria

  • acute myeloid leukemia (AML)
  • acute lymphoblastic leukemia (ALL).

Possible intermediate events

  • graft vs. host disease (GVHD): an immunological rejection response to the transplant
  • platelet recovery: a return of platelet count to normal levels.

One or the other, both in either order, or neither may occur.

End point events

  • relapse of the disease
  • death

Any or all of these events may be censored.

KMsurv::bmt data in R

[R code]
library(KMsurv)
?bmt

Analysis plan

  • We concentrate for now on disease-free survival (t2 and d3) for the three risk groups, ALL, AML Low Risk, and AML High Risk.
  • We will construct the Kaplan-Meier survival curves, compare them, and test for differences.
  • We will construct the cumulative hazard curves and compare them.
  • We will estimate the hazard functions, interpret, and compare them.

Survival Function Estimate and Variance

\[\hat{\text{S}}(t) = \prod_{t_i < t}\left[1-\frac{d_i}{Y_i}\right]\] where \(Y_i\) is the group at risk at time \(t_i\).

The estimated variance of \(\hat{\text{S}}(t)\) is:

Theorem 9 (Greenwood’s estimator for variance of Kaplan-Meier survival estimator) \[ \widehat{\text{Var}}{\left(\hat{\text{S}}(t)\right)} = \hat{\text{S}}(t)^2\sum_{t_i <t}\frac{d_i}{Y_i(Y_i-d_i)} \tag{3}\]

We can use Equation 3 for confidence intervals for a survival function or a difference of survival functions.

Kaplan-Meier survival curves

code to preprocess and model bmt data
library(KMsurv)
library(survival)
data(bmt)

bmt <-
  bmt |>
  as_tibble() |>
  mutate(
    group =
      group |>
      factor(
        labels = c("ALL", "Low Risk AML", "High Risk AML")
      ),
    surv = Surv(t2, d3)
  )

km_model1 <- survfit(
  formula = surv ~ group,
  data = bmt
)
[R code]
library(ggfortify)
autoplot(
  km_model1,
  conf.int = TRUE,
  ylab = "Pr(disease-free survival)",
  xlab = "Time since transplant (days)"
) +
  theme_bw() +
  theme(legend.position = "bottom")
Disease-Free Survival by Disease Group

Disease-Free Survival by Disease Group

Understanding Greenwood’s formula (optional)

\[ \begin{aligned} \hat{\text{S}}(t) &=\prod_{t_i<t}x_i/Y_i\\ \text{S}(t)&=\prod_{t_i<t}p_i \end{aligned} \]

\[ \begin{aligned} \frac{\hat{\text{S}}(t)}{\text{S}(t)} &= \prod_{t_i<t} \frac{x_i}. {p_iY_i} \\ &= \prod_{t_i<t} \frac{\hat p_i}{p_i} \\ &= \prod_{t_i<t} {\left(1+\frac{\hat p_i-p_i}{p_i}\right)} \\ &\approx 1+\sum_{t_i<t} \frac{\hat p_i-p_i}{p_i} \end{aligned} \]

\[ \begin{aligned} \text{Var}\left(\frac{\hat{\text{S}}(t)}{\text{S}(t)}\right) &\approx \text{Var}\left(1+\sum_{t_i<t} \frac{\hat p_i-p_i}{p_i}\right) \\ &=\sum_{t_i<t} \frac{1}{p_i^2}\frac{p_i(1-p_i)}{Y_i} \\ &= \sum_{t_i<t} \frac{(1-p_i)}{p_iY_i} \\ &\approx\sum_{t_i<t} \frac{(1-x_i/Y_i)}{x_i} \\ &=\sum_{t_i<t} \frac{Y_i-x_i}{x_iY_i} \\ &=\sum_{t_i<t} \frac{d_i}{Y_i(Y_i-d_i)} \\ \therefore\text{Var}{\left(\hat{\text{S}}(t)\right)} &\approx \hat{\text{S}}(t)^2\sum_{t_i<t} \frac{d_i}{Y_i(Y_i-d_i)} \end{aligned} \]

Test for differences among the disease groups

Here we compute a chi-square test for assocation between disease group (group) and disease-free survival:

[R code]
survdiff(surv ~ group, data = bmt)
#> Call:
#> survdiff(formula = surv ~ group, data = bmt)
#> 
#>                      N Observed Expected (O-E)^2/E (O-E)^2/V
#> group=ALL           38       24     21.9     0.211     0.289
#> group=Low Risk AML  54       25     40.0     5.604    11.012
#> group=High Risk AML 45       34     21.2     7.756    10.529
#> 
#>  Chisq= 13.8  on 2 degrees of freedom, p= 0.001

Cumulative Hazard

\[ \begin{aligned} {\lambda}(t) &\stackrel{\text{def}}{=}\text{p}(T=t|T\ge t)\\ &= \frac{\text{p}(T=t)}{\text{P}(T\ge t)}\\ &= -\frac{\partial}{\partial t}\text{log}{\left\{\text{S}(t)\right\}} \end{aligned} \]

The cumulative hazard (or integrated hazard) function is

\[{\Lambda}(t)\stackrel{\text{def}}{=}\int_0^t {\lambda}(t) dt\]

Since \({\lambda}(t) = -\frac{\partial}{\partial t}\text{log}{\left\{\text{S}(t)\right\}}\) as shown above, we have:

\[ {\Lambda}(t)=-\text{log}{\left\{\text{S}(t)\right\}} \]

So we can estimate \({\Lambda}(t)\) as:

\[ \begin{aligned} \hat {\Lambda}(t) &= -\text{log}{\left\{\hat{\text{S}}(t)\right\}}\\ &= -\text{log}{\left\{\prod_{t_i < t}{\left[1-\frac{d_i}{Y_i}\right]}\right\}}\\ &= -\sum_{t_i < t}\text{log}{\left\{1-\frac{d_i}{Y_i}\right\}} \end{aligned} \]

This is the Kaplan-Meier (product-limit) estimate of cumulative hazard.

Example: Cumulative Hazard Curves for Bone-Marrow Transplant (bmt) data

[R code]
autoplot(
  fun = "cumhaz",
  km_model1,
  conf.int = FALSE,
  ylab = "Cumulative hazard (disease-free survival)",
  xlab = "Time since transplant (days)"
) +
  theme_bw() +
  theme(legend.position = "bottom")
Figure 10: Disease-Free Cumulative Hazard by Disease Group

1.11 Nelson-Aalen Estimates of Cumulative Hazard and Survival

Definition 7 (Nelson-Aalen Cumulative Hazard Estimator)  

\[\hat{{\Lambda}}_{NA}(t) \stackrel{\text{def}}{=}\sum_{{\left\{i:\ t_i < t\right\}}}\hat{\lambda}_i\]

Theorem 10 (Variance of Nelson-Aalen estimator)  

\[ \begin{aligned} \hat{\text{Var}}\left(\hat H_{NA} (t)\right) &= \sum_{t_i <t}\frac{(d_i/Y_i)(1-d_i/Y_i)}{Y_i}\\ &\approx \sum_{t_i <t}\frac{d_i}{Y_i^2} \end{aligned} \tag{4}\]

Since \(\text{S}(t)=\text{exp}\left\{-{\Lambda}(t)\right\}\), the Nelson-Aalen cumulative hazard estimate can be converted into an alternate estimate of the survival function:

\[ \begin{aligned} \hat S_{NA}(t) &= \text{exp}\left\{-\hat H_{NA}(t)\right\}\\ &= \text{exp}\left\{-\sum_{t_i < t}\frac{d_i}{Y_i}\right\}\\ &= \prod_{t_i < t}\text{exp}\left\{-\frac{d_i}{Y_i}\right\}\\ \end{aligned} \]

Compare these with the corresponding Kaplan-Meier estimates:

\[ \begin{aligned} \hat H_{KM}(t) &= -\sum_{t_i < t}\text{log}\left\{1-\frac{d_i}{Y_i}\right\}\\ \hat S_{KM}(t) &= \prod_{t_i < t}\left[1-\frac{d_i}{Y_i}\right] \end{aligned} \]

Application to bmt dataset

[R code]
na_fit <- survfit(
  formula = surv ~ group,
  type = "fleming-harrington",
  data = bmt
)

km_fit <- survfit(
  formula = surv ~ group,
  type = "kaplan-meier",
  data = bmt
)

km_and_na <-
  bind_rows(
    .id = "model",
    "Kaplan-Meier" = km_fit |> fortify(surv.connect = TRUE),
    "Nelson-Aalen" = na_fit |> fortify(surv.connect = TRUE)
  ) |>
  as_tibble()
[R code]
km_and_na |>
  ggplot(aes(x = time, y = surv, col = model)) +
  geom_step() +
  facet_grid(. ~ strata) +
  theme_bw() +
  ylab("S(t) = P(T>=t)") +
  xlab("Survival time (t, days)") +
  theme(legend.position = "bottom")
Kaplan-Meier and Nelson-Aalen Survival Function Estimates, stratified by disease group

Kaplan-Meier and Nelson-Aalen Survival Function Estimates, stratified by disease group

The Kaplan-Meier and Nelson-Aalen survival estimates are very similar for this dataset.

Copelan, Edward A, James C Biggs, James M Thompson, Pamela Crilley, Jeff Szer, John P Klein, Neena Kapoor, Belinda R Avalos, Isabel Cunningham, and Kerry Atkinson. 1991. “Treatment for Acute Myelocytic Leukemia with Allogeneic Bone Marrow Transplantation Following Preparation with BuCy2.” https://doi.org/10.1182/blood.V78.3.838.838.
Dobson, Annette J, and Adrian G Barnett. 2018. An Introduction to Generalized Linear Models. 4th ed. CRC press. https://doi.org/10.1201/9781315182780.
Kalbfleisch, John D, and Ross L Prentice. 2011. The Statistical Analysis of Failure Time Data. John Wiley & Sons.
Klein, John P, and Melvin L Moeschberger. 2003. Survival Analysis: Techniques for Censored and Truncated Data. Vol. 1230. Springer. https://link.springer.com/book/10.1007/b97377.
Kleinbaum, David G, and Mitchel Klein. 2012. Survival Analysis: A Self-Learning Text. 3rd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-6646-9.
Rothman, Kenneth J., Timothy L. Lash, Tyler J. VanderWeele, and Sebastien Haneuse. 2021. Modern Epidemiology. Fourth edition. Philadelphia: Wolters Kluwer.
Vittinghoff, Eric, David V Glidden, Stephen C Shiboski, and Charles E McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. 2nd ed. Springer. https://doi.org/10.1007/978-1-4614-1353-0.